Data Science in image classification - Dog breeds

Data Science Workshop

Students: Eran Kayat & Andrey Katunin

Who's a good dog? Who likes ear scratches? Well, it seems those fancy deep neural networks don't have all the answers. However, maybe they can answer that ubiquitous question we all ask when meeting a four-legged stranger: what kind of good pup is that?

Hello, we are Eran and Andrey and we are doing our project on dogs breed identification. For a little background, from a young age we both had a big love for two things: computers and our dogs. When we started working together, we thought this project would be a good opportunity to combine the two.

We found two datasets of images of dogs. Each image has a filename that is its unique id. The two datasets comprises 120 breeds of dogs. The goal of the project is to create a classifier capable of determining a dog's breed from a photo.

We think we can create an application that will allow you to take a picture and the app will tell you what kind of dogs it is.
We all now that feeling where see a dog and we dont now the specific breed.

Also some dogs are illegal having a model that recognizes dog breeds can help local authorities enforce the law

Data Gathering

We downloaded 2 dog breed image data sets. One from kaggle and the other from Stanford and uploaded them to ourGoogle drive

We will now unzip our images

Data Preperation

Imports of the libraries we are going to use for our task

Now our images are in 2 different directories and the image structure is different. We want to merge our datasets into one and continue from there

We dont want to see any warnings

Kaggle Dataset

Lets start with the data set from kaggle and load it

We can see that in this data set we have 10221 rows and that each row contains the id of the image and the name of the dog breed

Stanford Dataset

Now we will load our Stanford data set

We can see now that we created a dataframe from the Stanford directories.

We have 20579 entries that each has an id and dog breed name

Merging the Datasets

Now we will merge the two datasets to one using an outer join

After the merger we have 30802 rows

EDA

Data Exploration

Lets see how many labels we have

We can see that we have 120 different dog breeds

Lets give each breed a label

Now for each dog in the data set lets add a path to the image

Lets look at what data we have

We have 30802 images split amongst 120 categores
Now lets check the distribution.

Droping rows we dont have labeling for(152 images dropped)

We can see the some categories contain much less data than the other ones. This might affect the accuracy of the model. We will leave it for now and maybe look for more data later.

We can see that the resolution of the images is not to high and that the distrubution is around 400 for both sides

Image Analysis

As we can see from the pictures above, these dogs are the same breed but the colors of the dogs and their positioning might confuse the algorithems to classify them correctly.

Lets take a look at two other images

This time, the pictures are of two diffrent breeds but they look very alike, so the algorithms might get the classification wrong and decide they are the same breed.

Classifcal Machine Learning

Multinomial Naive Bayes

Now we will create a copy of this df to perform feature extraction

Drop the id and breed as those are not relevant featrues

Now we will convert images to vector features using opencv and then try to classify the images using MultinomialNB

new_df is a data frame with the pixels as the columns

Now that we have all the images as pixel verctor we can start using classification algoritms

Here we defined our features and labels (X,y)

Spliting the data into train and test

Because we dont have enough memory to train on the whole data we need use algos that support partial fit This why we decided to start with MultinomialNB.

Now we are iterating over our train data dividing the pixel values by 255 so they will be normalized and then partially fitting the data

getting our predicitions

We can see that we got about 4.5% accuracy on test data, its very low and will not suite a production application

Idealy at 100% accuracy we would like to see the whole diognal in light color and the rest black for now we see a picture that is quite chaotic
Hopefuly with can improve in the future using more advanced techniques

We can see that we got 4 precent accuracy which is quite bad, now will try to improve the acuuracy by adding gabor features

Gabor + SVM

Before we tried to use the whole dataset and we run into memory issues now we will only use the number of imgages in the smallest category

We have 216 images in the smallest category so now we will reduce the amount of images in all of the categories to 216

Basically gabor filter analyze whether there is any specific frequency content in the image in specific directions in a localized region around the point or region of analysis.

No we will show an exampale of a gabor filter we can see that in the second picture we blur the whole image and only see the faces

Now we will resize the images to size 64 by 64 because of memory restrictions and extract five sets of differnt gabor features.

We will try to use support vector machines on our data for classification.

alt text

We got 25% accuracy which is better than for but still not enough and we think we can do much better

Here we can see that a diognal is starting to form but still it is not enough for us.

Gabor + Random Fotest

We will also try with a Random forest

alt text

We got 44% accuracy which is even better than the svm approach

Transfer Leanrning

We will now use a pretrained VGG-16 NN on our images and extract the last layers as features and the use ML aloritms on the extracted features

alt text

Extracting features with VGG-16

VGG_16 + Random Forest 100 estimators

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(vgg16_feature_list_np, y_vgg, test_size=0.25, random_state=42)

Wow with VGG_16 + radom forest with 100 estimators we got 73 precent

Looks much better with each run we are getting better and better results the diognal looks much more defined

Lets also see how we are doing per category

VGG_16 + Random Forest 200 estimators

We will also try with 200 and

VGG_16 + Random Forest 400 estimators

We can see that with more estimators we are doing much better but the more we increas the amount of estimators the amount of imporvement is less and less

We can look at the heatmap and see that eskimo_dog where classified 26% of the time as malamute, and they are indeed look very simillar.

We indeed suspected that whould happen becauuse the dogs are reallly simillar

VGG_16 + XGBoost

With XGboost we got slighly less accurate results even though it trained for much longer.
Maybe we could increase the amount of estimators but we decided to try something else as the trainig time is very long.

Neural Networks

Now we will reorganize the data for easier access from Imagedatagenerator

Create directory for every breed

Copy all Images to its corresponding directory

With Data Augmantation

We are using data augmantation to artificially expand the size of a training dataset by creating modified versions of images in the dataset

We can see that after we apply the augmantation the dog image is rotated, zoomed and sheared in different ways
This allows us to get more from less data
It alows us to give the model slightly diffrent data for each epoch to avoid overfitting

First CNN

We will first try our own CNN and see if a simple CNN can accuretly predict the dog breeds

Here we define a neural network with 7 convolution layers and one fully connected one

We are using the Agam optimizer


We trained for 50 epoch (The model has seen the data 50 times) and we see that the best accuracy on the validation set is at max 35%
Which is worse than our other models

We can see that we are overfitting to the data as the training accuracy line (dotted) is diverging from the validation line (continuous)

Xception

Using pretrained Xception network and then we finetune it on our own data

Defining the neural network.
The first layers are the layers of the Xception network pretrained on imagenet, and then we have a fully connected layer with 512 nodes

Without data augmantation

First lets try to classify first with classical machine learning

We will first try our own CNN and see if a simple CNN can accuretly predict the dog breeds

We can see that without data augmantation we are quickly overfitting to the data, after 3 epcohs we already reached 96 precent accuracy on the training data
and only 50 on validation data, even though the results are better than the model with augmantation this model overfits and does not improve with epochs.

Using pretrained conv layer

defining the neural network

We can see that after 5 epochs we come close to the maximum validation accuracy, and we can see from the graph how quickly we overfit the data. We think using data augmantation is better beacuse it allows us to not overfit the data and if we would want to deploy a model to a production application its better to have a model that has seen more data instead of one that saw the same data a bunch of times

ViT transformer

After reading this paper: https://openreview.net/forum?id=YicbFdNTTy
We decided to try a visual transformer approach.
Historically the best performing models for image classification has been deep convolutional networks live ResNet Xception ... This paper proposes a different approach that does not rely on the convolution operator. Traditionaly this apptoach was used for NLP applications.

The steps are: Split an image into patches

Flatten the patches

Produce lower-dimensional linear embeddings from the flattened patches

Add positional embeddings

Feed the sequence as an input to a standard transformer encoder

Pretrain the model with image labels (fully supervised on a huge dataset)

Finetune on the downstream dataset for image classification

The main advatage of this approach is lower computational requirements

alt text

We can already see here the advantages of ViT.
We are using a simiilar image data generator as before but this time we can use a batch size of 32
Before we could not do that as we did not have enough memory on the GPU

Downloading a pretrained ViT model (pretrained on imagenet)

We will use the ViT model as the first layer followed by a fully connected one with 256 nodes.

Defining the learning rate then the optimizer.
We also defined a dynamic learning rate which basically means that if we reach a plateau the learning rate will be decreased and try to improve even more.
We defined early stopping which will stop the learing proccess if there was no improvement for 5 epochs.
And also checkpoint to save the bast weights so far

We can see that even though here we performed a little bit worse compared to the Xception network (90% vs 88%) the training took much less time (7 hours vs 2.5 hours)

And I think we could tune the hyper parameters of the ViT model more and perhaps match the results of the Xception network

Ideas For Improvement

  1. We had a problem with Chihuahuas we noticed to late that half of the chihuahua pictures werenot copied to right directory and that would explain the low results we got for chihuhuas on all models. This was noticed late and we didnt have the time to retrain the models.
  1. We could also try to add more data and see if we can improve the accuracy.
  1. We didnt play to much with hyperparameters as learning times were really long sometimes even more the 15 hours, So in our opinion we maybe could improve some of the models.
  1. We had an idea but didnt have the time to implement it of using a NN to locate the face of the dog on the image and crop the image aroud the face and use classification on that cropped image.
  1. We did not use auto encoders perhaps the could have helped

Conclusions

Results by model/algorithem

Naive Bayes - As predicted its a very poor model but it was helpfull as a test for the first try with 4% accuracy. 3 hours training + 3 hours loading the data to a df

Gabor+SVM - With gabor filters we got much better accuracy but still wasnt good enough with 24% accuracy. 3 hours training + 2 hours to prepare df

Gabor+Random Forest - We got 44% accuracy which is even better than the svm approach. 1 hour training + 2 hours to prepare df

VGG-16 + Random forest 100-400 estimators Using VGG-16 as fetature extractor and applying RF worked alot better than our previous models. (73%,75%,77%) The fastest approaches took about 2 hours to train including feature extraction

VGG16 + XGBoost - Another try with VGG16 and XGBoost performed worse then RFs with 72% accuracy. 15 hours

Classic CNN with data augmantation - We used a our own CNN which resulted with poor results at 34% accuracy. 7 hours

Xception with data augmantation This is a pretrained CNN worked much better with 91% accuracy. 7 hours

Classic CNN without data augmantation - Trained faster without augmantation but over fitted the data. 50% accuracy 7 hours

Xception without data augmantation Trained similarly as without aug. 90% accuracy 7 hours

ViT with data augmantation Trained much faster than exception got similar results. 88% accuracy 2.5 hours

Struggles

  1. We wish we could have used better hardware,It was impossible to load all the images to a dataframe with high quality because of lack of memory. We tried using colab but the lack of mememory and constant disconnection made us swiftly change to our own hardware. So we had to do it on our own hardware even then lack of memory and GPU power caused frequent computer crashes and long runtimes made us do a lot adjusments.
  2. We were too ambitious to try using two diffrent datasets which raised alot of problems we had to fix before even starting to run the main algorithems on the data. Because of the size of the data we had memory issues and slow running times.

Self reflection

Eran: